This project analyzes customer purchasing behavior by segmenting products into two cost groups using clustering techniques. Additionally, RFM(Recency, Frequency, Monetary) analysis is conducted to classify customers based on purchasing activity and value. The study identifies high- and low-value customer segments, frequent purchase patterns, and product associations using association rule mining (Apriori Algorithm). Insights from these Analysis enable data-driven decision-making, such as personalized marketing strategies, product bundling, and inventory optimization. The findings are visualized through an interactive Shiny dashboard, providing actionable intelligence for business growth.
con <- dbConnect(odbc::odbc(),
Driver = "ODBC Driver 17 for SQL Server",
Server = "localhost",
Database = "porfolio",
Trusted_Connection = "Yes")
df <- dbGetQuery(con, "SELECT * FROM SalesB")
dbDisconnect(con)
df_copy<-df
DATA SUMMARY: The dataset consists of 522,316
transactions, including InvoiceNo, StockCode,
Description, Quantity,
InvoiceDate, UnitPrice,
CustomerID, and Country. Transactions span
from December 2010 to December 2011. The Quantity ranges from 1 to 500,
while UnitPrice varies from 0.001 to 649.5. There are
131,420 missing CustomerIDs, indicating anonymous purchases. The dataset
captures global transactions across multiple countries.
head(df)
## InvoiceNo StockCode Description Quantity
## 1 536592 22489 PACK OF 12 TRADITIONAL CRAYONS 2
## 2 536592 22499 WOODEN UNION JACK BUNTING 1
## 3 536592 22501 PICNIC BASKET WICKER LARGE 2
## 4 536592 22503 CABIN BAG VINTAGE PAISLEY 1
## 5 536592 22508 DOORSTOP RETROSPOT HEART 2
## 6 536592 22530 MAGIC DRAWING SLATE DOLLY GIRL 3
## InvoiceDate UnitPrice CustomerID Country
## 1 2010-12-01 17:06:00 0.85 NA United Kingdom
## 2 2010-12-01 17:06:00 12.72 NA United Kingdom
## 3 2010-12-01 17:06:00 21.23 NA United Kingdom
## 4 2010-12-01 17:06:00 59.53 NA United Kingdom
## 5 2010-12-01 17:06:00 7.62 NA United Kingdom
## 6 2010-12-01 17:06:00 0.85 NA United Kingdom
summary(df)
## InvoiceNo StockCode Description Quantity
## Length:522316 Length:522316 Length:522316 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Mode :character Median : 4.000
## Mean : 9.631
## 3rd Qu.: 11.000
## Max. :500.000
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00.00 Min. : 0.001 Min. :12347
## 1st Qu.:2011-03-28 12:13:00.00 1st Qu.: 1.250 1st Qu.:13969
## Median :2011-07-20 11:55:00.00 Median : 2.080 Median :15157
## Mean :2011-07-04 16:52:47.49 Mean : 3.289 Mean :15294
## 3rd Qu.:2011-10-19 11:49:00.00 3rd Qu.: 4.130 3rd Qu.:16794
## Max. :2011-12-09 12:50:00.00 Max. :649.500 Max. :18287
## NA's :131420
## Country
## Length:522316
## Class :character
## Mode :character
##
##
##
##
This step examines the distribution and relationship between Quantity and UnitPrice. The Anderson-Darling normality test results (p-value < 2.2e-16) confirm that both variables deviate significantly from a normal distribution. Boxplots highlight the presence of outliers, particularly for high UnitPrice values. A Spearman correlation of -0.41 suggests a moderate negative, non-linear relationship. Scatter plots reveal that as price increases, quantity purchased drops sharply at first, then levels off, indicating that customers are highly sensitive to price changes at lower values but less reactive at higher prices. A smoothed trend line further confirms this diminishing impact of price on quantity.
1. Normality Test
# Anderson-Darling test for Quantity
ad.test(df$Quantity)
##
## Anderson-Darling normality test
##
## data: df$Quantity
## A = 95707, p-value < 2.2e-16
# Anderson-Darling test for Quantity
ad.test(df$UnitPrice)
##
## Anderson-Darling normality test
##
## data: df$UnitPrice
## A = 50549, p-value < 2.2e-16
2. Identifying Outliers with Boxplots:
boxplot(df$UnitPrice, main = "Boxplot of UnitPrice", col = "lightblue")
boxplot(df$Quantity, main = "Boxplot of Quantity", col = "lightblue")
3. Checking Correlation Between Quantity and Unit Price:
cor(df$Quantity, df$UnitPrice, method = "spearman", use = "complete.obs") # non-linear
## [1] -0.4063089
ggplot(df, aes(x = UnitPrice, y = Quantity)) +
geom_point() +
coord_cartesian(xlim = c(0, 30), ylim = c(0, 100)) + # zoom into a specific range
geom_smooth()+
theme_minimal() +
labs(title = "Zoomed-In Scatter Plot", x ="UnitPrice" , y = "Quantity")
The analysis explored purchasing trends by clustering products into Low Cost and High Cost groups using K-means clustering. After normalizing prices, products were classified based on their average price. A Kruskal-Wallis test confirmed significant differences in purchase quantities across clusters. A box plot (log scale) visualized these variations, while regression analysis showed a stronger negative price-quantity relationship in the Low Cost group. The top 10 best-selling products per cluster were identified and plotted, highlighting distinct purchasing behaviors. This analysis provides insights into how price influences demand and helps businesses optimize pricing strategies.
# Select relevant columns
df_cluster <- df %>%
select(StockCode, UnitPrice) %>%
group_by(StockCode) %>%
summarise(AvgPrice = mean(UnitPrice), .groups = "drop")
# Normalize Unit Price (to ensure fair clustering)
df_cluster$ScaledPrice <- scale(df_cluster$AvgPrice)
set.seed(123) # Ensure reproducibility
kmeans_result <- kmeans(df_cluster$ScaledPrice, centers = 2, nstart = 25)
# Assign clusters
df_cluster$Cluster <- factor(kmeans_result$cluster, labels = c("Low Cost", "High Cost"))
Visualization of Clusters:
Plotted product clusters based on average price using a scatter plot, distinguishing Low Cost and High Cost products.
ggplot(df_cluster, aes(x = AvgPrice, y = Cluster, color = Cluster)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Product Clustering by Price", x = "Average Price", y = "Cluster")
Quantity Comparison by Cluster:
Used a box plot (log scale) to compare the quantity sold in each cluster, revealing distribution differences.
ggplot(df, aes(x = Cluster, y = Quantity)) +
geom_boxplot() +
scale_y_log10() + # Log scale to compress extreme values
theme_minimal() +
labs(title = "Box Plot of Quantity by Cluster (Log Scale)",
x = "Cluster", y = "Quantity (Log Scale)")
Statistical Testing:
# Kruskal-Wallis test for comparing Quantity across Clusters
kruskal_test <- kruskal.test(Quantity ~ Cluster, data = df)
# Print the result directly
print(kruskal_test)
##
## Kruskal-Wallis rank sum test
##
## data: Quantity by Cluster
## Kruskal-Wallis chi-squared = 155.22, df = 1, p-value < 2.2e-16
df %>%
group_by(Cluster) %>%
do(model = lm(Quantity ~ UnitPrice, data = .)) %>%
summarise(Cluster, Intercept = coef(model)[1], Slope = coef(model)[2])
## # A tibble: 2 × 3
## Cluster Intercept Slope
## <fct> <dbl> <dbl>
## 1 Low Cost 12.6 -0.918
## 2 High Cost 4.02 -0.0168
Top-Selling Products per Cluster:
The analysis segmented customers using RFM (Recency, Frequency, Monetary) analysis and classified them into High Value and Low Value groups based on Customer Lifetime Value (CLV). The top 10 countries for each segment were identified and visualized. The most frequently purchased products by high and low-value customers were analyzed. A time-series analysis of purchase trends showed variations in spending patterns over time. These insights help businesses target high-value customers effectively and optimize product offerings and marketing strategies.
# Create a snapshot date (latest date in the dataset for recency calculation)
snapshot_date <- max(df$InvoiceDate)
# Calculate Recency, Frequency, and Monetary
rfm_data <- df %>%
group_by(CustomerID) %>%
summarise(
Recency = as.numeric(difftime(snapshot_date, max(InvoiceDate), units = "days")), # Days since last purchase
Frequency = n_distinct(InvoiceNo), # Number of purchases
Monetary = sum(Quantity * UnitPrice) # Total spending
)
# Scale the RFM values
rfm_data <- rfm_data %>%
mutate(
Recency_Scaled = scale(Recency),
Frequency_Scaled = scale(Frequency),
Monetary_Scaled = scale(Monetary)
)
# View the result
#head(rfm_data)
Segment Customers:
# Calculate Average Order Value (AOV)
avg_order_value <- mean(rfm_data$Monetary)
# Calculate CLV by customer segment
rfm_data <- rfm_data %>%
mutate(
CLV = Monetary * Frequency # A simple estimate of CLV (you can refine this model)
)
# Classify customers as High or Low Value based on CLV
rfm_data <- rfm_data %>%
mutate(
CLV_Segment = ifelse(CLV > median(CLV), "High Value", "Low Value")
)
# View the result
head(rfm_data)
## # A tibble: 6 × 9
## CustomerID Recency Frequency Monetary Recency_Scaled[,1] Frequency_Scaled[,1]
## <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 12347 1.87 7 4310. -0.902 0.110
## 2 12348 75.0 4 1437. -0.172 -0.0251
## 3 12349 18.1 1 1458. -0.740 -0.161
## 4 12350 310. 1 294. 2.17 -0.161
## 5 12352 35.9 7 1386. -0.562 0.110
## 6 12353 204. 1 89.0 1.12 -0.161
## # ℹ 3 more variables: Monetary_Scaled <dbl[,1]>, CLV <dbl>, CLV_Segment <chr>
The approach involves preparing transaction data by grouping product descriptions for each invoice and filtering transactions with more than three items. The items are then split and converted into a “transactions” format for Association Rule Mining. The Apriori algorithm is applied with specified thresholds for support and confidence to uncover frequent itemsets. The left-hand side (lhs) and right-hand side (rhs) of the rules are extracted, representing commonly co-occurring products. The frequency of products is calculated and sorted to identify popular items bought together. Visualizations are created to explore these relationships, providing insights for marketing or sales strategies based on product associations.
# View the rules
inspect(head(rules))
## lhs rhs support confidence coverage lift count
## [1] {CHILDRENS CUTLERY DOLLY GIRL } => {CHILDRENS CUTLERY SPACEBOY } 0.01122283 0.7594502 0.01477757 41.541924 221
## [2] {CHILDRENS CUTLERY SPACEBOY } => {CHILDRENS CUTLERY DOLLY GIRL } 0.01122283 0.6138889 0.01828154 41.541924 221
## [3] {CHILDRENS CUTLERY POLKADOT BLUE} => {CHILDRENS CUTLERY POLKADOT PINK} 0.01117205 0.7612457 0.01467601 35.438416 220
## [4] {CHILDRENS CUTLERY POLKADOT PINK} => {CHILDRENS CUTLERY POLKADOT BLUE} 0.01117205 0.5200946 0.02148080 35.438416 220
## [5] {PAINTED METAL PEARS ASSORTED} => {ASSORTED COLOUR BIRD ORNAMENT} 0.01315255 0.7000000 0.01878936 9.493388 259
## [6] {PINK HAPPY BIRTHDAY BUNTING} => {BLUE HAPPY BIRTHDAY BUNTING} 0.01330490 0.6517413 0.02041438 32.907922 262
## Product Frequency
## 1 {JUMBO BAG RED RETROSPOT} 86
## 2 {RED RETROSPOT CHARLOTTE BAG} 45
## 3 {JUMBO STORAGE BAG SUKI} 42
## 4 {LUNCH BAG RED RETROSPOT} 41
## 5 {JUMBO BAG PINK POLKADOT} 36
## 6 {CHARLOTTE BAG SUKI DESIGN} 31
## 7 {JUMBO SHOPPER VINTAGE RED PAISLEY} 30
## 8 {LUNCH BAG BLACK SKULL.} 29
## 9 {WOODLAND CHARLOTTE BAG} 27
## 10 {CHARLOTTE BAG PINK POLKADOT} 24
library(arulesViz)
plot(rules, engine = "plotly")
plot(subrules, method = "graph", engine = "htmlwidget")